3we Thank a Cl/dc1 (data Collection Initiative), the Collins Publishing Company, the Wall Street Journal, for Providing Invaluable On-line Data, and the Treebank Project for Providing Tagged Corpus for Reference. Pre-processing: the Big Picture Re-processing up Against?

نویسنده

  • Uri Zernik
چکیده

Thematic analysis is best manifested by contrasting collocations1 such as “shipping pacemakers” vs. “shipping departments”. While in the first pair, the pacemakers are being shipped, in the second one, the departments are probably engaged in some shipping activity, but are not being shipped. Text pre-processors, intended to inject corpus-based intuition into the parsing process, must adequately distinguish between such cases. Although statistical tagging [Church et al., 1989; Meteer et al., 1991; Brill, 1992; Cutting et al., 19921 has attained impressive results overall, the analysis of multiple-contentword strings (i.e., collocations) has presented a weakness, and caused accuracy degradation. To provide acceptable coverage (i.e., 90% of collocations), a tagger must have accessible a large databa.se ( i.e., 250,000 pairs) of individually analyzed collocations. Consequently, training must be based on a corpus ranging well over 50 million words. Since such a large corpus does not exist in a tagged form, training must be from raw corpus. In this paper we present an algorithm for text tagging based on thematic analysis. The algorithm yields high-accuracy results. We provide empirical results: The program NLcp (NL corpus processing) acquired a 250,000 thematic-relation database through the 85million word Wall-Street Journal Corpus. It was tested over the Tipster 66,000-word Joint-Venture corpus. z

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Text Analytics of Customers on Twitter: Brand Sentiments in Customer Support

Brand community interactions and online customer support have become major platforms of brand sentiment strengthening and loyalty creation. Rapid brand responses to each customer request though inbound tweets in twitter and taking proper actions to cover the needs of customers are the key elements of positive brand sentiment creation and product or service initiative management in the realm of ...

متن کامل

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

Information Technology for Project Cost Management (Case study: Soufian Cement Co, Iran)

In today's competitive world, reduction of production costs has become one of the corporates priorities. Survival triangle (cost, quality and time) is the solution that helps companies focus on these three dimensions and have the ability to compete with other companies. Cost management is the first step in this way that providing solutions and advice to managers who need help to have a precise ...

متن کامل

2016 Olympic Games on Twitter: Sentiment Analysis of Sports Fans Tweets using Big Data Framework

Big data analytics is one of the most important subjects in computer science. Today, due to the increasing expansion of Web technology, a large amount of data is available to researchers. Extracting information from these data is one of the requirements for many organizations and business centers. In recent years, the massive amount of Twitter's social networking data has become a platform for ...

متن کامل

Design and Test of the Real-time Text mining dashboard for Twitter

One of today's major research trends in the field of information systems is the discovery of implicit knowledge hidden in dataset that is currently being produced at high speed, large volumes and with a wide variety of formats. Data with such features is called big data. Extracting, processing, and visualizing the huge amount of data, today has become one of the concerns of data science scholar...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999